Method of noise reduction in speech signals and an apparatus for performing the method

ABSTRACT

A method and apparatus for noise reduction in a speech signal wherein a first spectrum is generated on the basis of the speech signal and a second spectrum is generated as an estimate of the noise power spectrum. A third spectrum is generated by performing a spectral subtraction of the first and second spectra, and a resulting speech signal is generated on the basis of the third spectrum. A model-based representation describing the quasi-stationary part of the speech signal is generated on the basis of the third spectrum. The model-based representation is manipulated, and the resulting speech signal is generated using the manipulated model-based representation and a second signal derived from the speech signal.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates to noise reduction in speech signals.

2. The Prior Art

Noise, when added to a speech signal, can impair the quality of thesignal, reduce intelligibility, and increase listener fatigue. It istherefore of great importance to reduce noise in a speech signal inrelation to hearing aids, but also in relation to telecommunication.

Various methods of noise reduction in a speech signal are known. Thesemethods include spectral subtraction and other filtering methods, e.g.,Wiener filtering. Spectral subtraction is a technique for reducing noisein speech signals, which operates by converting a time domainrepresentation of the speech signal into the frequency domain, e.g., bytaking the Fourier transform of segments of the speech signal. Hereby aset of signals representing the short term power spectrum of the speechis obtained. During the speech-free periods, an estimate of the noisepower spectrum is generated. The obtained noise power spectrum issubtracted from the speech power spectrum signals in order to obtain anoise reduction. A time domain speech signal is reconstructed using theresulting spectrum, e.g., by use of the inverse Fourier transform.Hereby the time-domain signal is reconstructed from the noise-reducedpower spectrum and the unmodified phase spectrum.

Even though this method has been found to be useful, it has the drawbackthat the noise reduction is based on an estimate of the noise spectrumand is therefore dependent on stationarity in the noise signal toperform optimally.

As the noise in a speech signal is often non-stationary, the estimatednoise spectrum used for spectral subtraction will be different from theactual noise spectrum during speech activity. This error in noiseestimation tends to affect small spectral regions of the output, andwill result in short duration random tones in the noise reduced signal.Even though these random noise tones are often a low-energy signalcompared to the total energy in the speech signal, the random tone noisetends to be very irritating to listen due to psycho-acoustic effects.

The object of the invention is to provide a method which enables noisereduction in a speech signal, and which avoids the above-mentioneddrawbacks of the prior art.

SUMMARY OF THE INVENTION

The invention is based on the circumstance that a model-basedrepresentation describing the quasi-stationary part of the speech signalcan be generated on the basis of a third spectrum, which is generated byspectral subtraction of a first spectrum generated on the basis of aspeech signal and a second spectrum generated as an estimate of thenoise power spectrum. The spectral subtraction enables the use ofmodel-based representation for speech signals including noise, and themodel-based representation of the quasi-stationary part of the speechsignal enables an improved noise reduction compared to methods of priorart, as it enables use of a prior knowledge of speech signals.

This unconventional use of a combination of both traditional andmodel-based methods of noise reduction in a speech signal isadvantageous, as it permits smooth manipulation of the speech signal inorder to obtain improved noise reduction without artefacts.

As the model based representation is generated dynamically, i.e., on thefly, movements of the formants in the third spectrum will not affect thequality of the noise reduction, and the method according to theinvention is therefore advantageous compared to methods of the priorart.

Preferably, the model-based representation can include parametersdescribing one or more formants in the third spectrum. This isadvantageous as the formants, i.e., peaks in the signal spectrum, whichare related to the speech, in a the third spectrum contains essentialfeatures of the speech signal, and as it is possible to manipulate theformants by using the parameters, and hereby to manipulate the resultingspeech signal.

The parameters preferably reflect the resonance frequency, thebandwidth, and the gain at the resonance frequency of the formants inthe third spectrum.

In a preferred embodiment, the manipulation can include spectralgaining, which is based on a structure parameters reflecting structurein the spectrum. Spectral gaining attenuates relatively broad fox wantssince these cause unwanted artefacts. This method is based on the factthat man-made speech produces narrow formats in the absence of noise.

The structure parameter S can be preferably given by S=B*G, where B isthe bandwidth ratio of the formants in the third spectrum, and G is thegain ratio of the formants in the third spectrum.

Noise reduction is preferably performed in said second signal. This isadvantageous as noise will also be present in the second signal, and anoise reduction in this signal will therefore result in a noisereduction in the resulting signal.

The second signal can correspond to the speech signal. This isadvantageous in some cases, e.g., when the signal/noise ratioapproximately equals 0 dB.

The second signal can represent the residual signal, i.e., thenon-stationary part of the speech signal such as information reflectingthe articulation. This is advantageous in some cases, e.g., when thesignal/noise ratio approximately equals 6 dB.

Various signal elements of the second signal, such as pitch pulses, stopconsonants and noise transients, can be preferably amplified orattenuated. This is advantageous in some cases, e.g., when thesignal/noise ratio approximately equals −6 dB.

The present invention also relates to an apparatus for noise reductionin speech signals.

The invention will be explained more fully by the following descriptionwith reference to the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows a schematic diagram of prior art;

FIG. 2 shows a schematic diagram of one preferred embodiment of thepresent invention;

FIG. 3 illustrates some formants of a speech signal along with someparameters describing one formant;

FIG. 4a shows the dependency between the structure parameter, STRUK, andthe bandwidth threshold;

FIG. 4b shows the gain attenuation factor as a function of the bandwidththreshold;

FIG. 5a is a block diagram of an apparatus utilizing the methodaccording to the invention; and

FIG. 5b shows some aspects from. FIG. 5a in a greater detail.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

The prior art is described with reference to FIG. 1. The figureillustrate an apparatus where a speech signal S is connected to theinput terminal of a spectrum generating means 1. The output terminal ofthe spectrum generating means 1 is connected to a spectral. subtractionmeans 5. A measured noise signal N is connected to the input terminal ofa noise spectrum generating means 2. The output terminal of the noisespectrum generating means 2 is connected to a second input terminal ofthe spectral subtraction means 5. The output terminal of the spectralsubtraction means 5 is connected to the input terminal of a signalgenerating means 9. The signal generating means 9 is adapted to generatethe resulting speech signal RS, which is connected to the outputterminal.

At 1 segments of the speech signal including noise, S, in the timedomain are transformed into a representation in the frequency domain,e.g. by use of the FFT (Fast Fourier Transform). During speech freeperiods an estimate of the noise power spectrum is calculated from abackground noise signal, N, and stored at 2. The estimate of the noisepower is then subtracted from the spectral representation of the speechsignal resulting in yet another spectrum with a reduced amount of noiseif a good estimate for the noise power spectrum could be obtained andthe background noise has not changed that much since. This is done at 5.This procedure is often called ‘Spectral Subtraction’. The resultingspectrum is then transformed back into the time domain at 9, e.g., bythe inverse FFT, thereby generating the resulting speech signal, RS.

FIG. 2 schematically shows an improved method according to a preferredembodiment of the present invention. The figure illustrate an apparatusaccording to the invention, where a speech signal S is connected to theinput terminal of a spectrum generating means 12. The output from thespectrum generating means 12 is connected to a first input terminal of aspectral subtraction means 15. The apparatus also includes a noisespectrum generating means 10 having a input terminal, which is connectedto a measured noise signal N, and a output terminal, which is connectedto a second input terminal of the spectral subtraction means 15. Asshown on the figure, the apparatus also includes a model generatingmeans 17, a model manipulating means 18, and a signal generating means19, which are connected in series. A second signal generating means 14has an input terminal, which is also connected to the speech signal, andan output terminal which is connected to a second input terminal of thesignal generating means 19. The signal generating means 19 is adapted togenerate the resulting speech signal RS.

At 10 an estimate of the noise power spectrum is calculated from abackground noise signal, N, during speech free periods. The estimate isstored for later use. This estimate spectrum is called the secondspectrum hereinafter. At 12 segments of the speech signal includingnoise, S, in the time domain are transformed into a spectralrepresentation, e.g. by the FFT, in the frequency domain. This spectrumis called the first spectrum hereinafter. The second spectrum is thensubtracted from the first spectrum at 15, resulting in a noise-reducedspectrum, called the third spectrum hereinafter. This result is notalways sufficient or satisfactory as mentioned above. So, in accordancewith this invention the third spectrum is used for generating a modelbased description of the speech signal. This is done at 17, and enablesthe use of the model based description in noisy environments. Thecombination of spectral subtraction reduces the noise, thereby enablingthe use of a model based description to gain even greater noisereduction.

The model based description ensures simple control of formants, andthereby the essential features of the speech signal, through parameterslike the resonance frequency (f), the bandwidth (b) and the gain (g) ofeach formant (see also FIG. 3). The model can be derived using knownmethods, e.g. the method used in the Partran Tool, which is described inarticles by U. Hartmann, K. Hermansen and F.K. Fink: “Feature extractionfor profoundly deaf people”, D.S.P. Group, Institute for ElectronicSystems, Alborg University, September 1993, and by K. Hermansen, P.Rubak, U. Hartman and F. K. Fink: “Spectral sharpening of speech signalsusing the partran tool”, Alborg University.

These three parameters, f, b, and g, for each relevant formant captureall the essential features of the quasistationary part of a speechsignal. These parameters are manipulated at 18 in order to reduceartefact sounds, e.g. “bath tub” sounds, and to reduce the noise evenfurther. Artefacts are distorted sounds with a low signal power and willtypically not be removed by any methods according to the prior art.However, these sounds have been found to be very disturbing andirritating to the human ear, which is well-known from variouspsycho-acoustic tests. The manipulated parameters are then used togetherwith a signal S₂ which is derived from the original speech signal at 14,in order to obtain a time varying speech signal with reduced noise andartefacts. The resulting f, b, and g parameters are used to form thepulse response for the synthesis filter 19. Convolution of signal S₂ andsaid pulse response forms the resulting speech signal RS.

FIG. 3 illustrates the relation between the individual formants and theparameters f, b and g in greater detail.

In a spectrum of a human speech signal there will always be formantspresent in the absence of noise, and these will typically have thelargest (and the most important formant with respect to intelligibility)formant at the lowest frequency, while the additional formants typicallyhave a decreasing amplitude as their resonance frequency gets bigger.The fact that the biggest formant carries quite a lot of the relevantinformation enables a human being to understand the speech even if allthe other formants have “drowned” in noise.

Due to the fact that human speech incorporates a given structure forphysiological reasons, and the fact that ‘ordinary’ background noise(e.g., white or pink noise) is highly disorganized/unstructured (Aspectrum showing “ordinary” background (e.g., white) noise would consistof all frequencies present with more or less the same a amplitude), agiven parameter reflecting the structure of a given sound/speech cancharacterize the amount of noise present in that particularsound/speech. If the sound/speech incorporates a high level ofstructure, then the signal does not contain much noise, since noise isunstructured. A parameter is used in order to describe the structure inthe speech signal. The but one disclosed in this embodiment has beenfound to be a good and reliable choice. This choice is one of perhapsmany and should not limit the present invention. The parameter used inthis invention is called STRUK and is defined as:${STRUK} = {\frac{\max \left( {h;} \right)}{\min \left( {h,1.} \right)}\frac{\max ({gl})}{\min \left( {g,j} \right)}}$

that is the ratio of the maximum to the minimum value of all of thebandwidths for the available formants multiplied by the ratio of themaximum to the minimum value of all of the gain values for the availableformants. In this particular embodiment b is given at the 3 dBattenuation from the resonance frequency and g is given at the resonancefrequency. Other choices will be apparent to one skilled in the art. Thebasic idea of spectral gaining is to “punish” great bandwidths, as suchare indicators of a missing structure. If STRUK is large (e.g. 100), thespectrum holds little noise, and if STRUK is relatively small (e.g., 5)the spectrum holds much noise.

FIG. 3 shows two formants (the two to the left) with a resulting modeldescription together with two other formants (the two to the right) thatare ‘drowned’ in noise. Due to the fact described above the modeldescription will be perceived as quite good even though only twoformants are included in the model. This makes the method according tothe present invention robust.

The parameter STRUK gives an easily modifiable one-valued parameter todetermine the level of noise still present in the third spectrum. Themodel description makes it easy to modify the spectrum in order toremove unwanted artefacts and noise. This is done through the completecontrol of the parameters describing the formants (f, b and g). One wayto reduce the noise is by ‘punishing’ formants with a relatively broadbandwidth by attenuating these, since it is in the nature of man-madesound that the formants are relatively narrow. The attenuation is doneby using the parameter STRUK and the two relations shown in FIGS. 4a and4 b, which show a bandwidth threshold as a function of STRUK (FIG. 4a)and the gain attenuation as a function of the bandwidth threshold (FIG.4b). Here it is shown that for a large value of STRUK (little noise) thebandwidth threshold is relatively large (e.g. 400 Hz), and thus the gainattenuation only attenuates relative broad formants. For a small valueof STRUK (much noise) the bandwidth threshold is relatively small (e.g.200 Hz) and the gain attenuation attenuates formants even when they arenot very broad. That broad formants are attenuated can be seen in FIG.3. Often it will be the case that the low frequency formants willsurvive the attenuation, which is desirable since these contain the mostinformation relevant to the human ear, removing the broad formants inthe process, which is desirable as well since these broad formants willoften be perceived as artefacts by the human ear.

Again the model based approach with its small number of parametersensures that a modification can be quite simple in order to obtain anoise reduction and/or artefact removal. The model based approachfurther has the advantage that if one has to transmit a speech signal,then the amount of data needed is greatly reduced by only having a smallnumber of parameters describing the formants and thereby the speechsignal.

FIG. 5a illustrates an apparatus according to the invention, where aspeech signal connected to the input terminal of pre-emphazising means50. The output terminal is connected to a input terminals of Hammingweighting signal means 52, inverse LPC analysis/filtering means 58, andto a first input terminal of the synthesis filter 74, and thepost-emphasizing means 79 adapted to compensate for the effect of thepre-emphasizing means 50 mentioned previously. The output terminal ofthe Hamming weighted signal means 52 is connected in series to thespectrum generating means 60 adapted, diode-rectifying means 62,spectral subtraction means 69, effect means 66, autocorrelation means68, LPC model parameters determination means 70, the functional block76, and to a second input terminal of the synthesis filter 74 and to theinput terminal of the autocorrelation means 54. The output terminal ofthe autocorrelation means 54 is connected to LPC model parametersdetermination means 56. The LPC model parameters are connected to theinverse LPC analysis/filtering means 58. The apparatus further comprisesa pitch detection means 72 with an input and an output terminalconnected to the output terminal of the inverse LPC analysis/filteringmeans 58 and to a third input terminal of the synthesis filter 74respectively. The synthesis filter 74 is adapted to select an inputsignal from one of the input terminals dependent on the noise level. Theselected signal is called the second signal hereinafter. The selectioncan be performed in several ways. Noise reduction means can be used inorder to obtain additional noise reduction in said second signal usingknown methods if desired.

FIG. 5b illustrates in greater detail the functional block 76, where theinput signal is connected in series to: pseudo decomposition means 77,spectral gaining means 78, spectral sharpening means 80 and pseudocomposition means 82.

FIGS. 5a and 5 b illustrate a block diagram of an apparatus utilizingthe described method. The signal to be processed is given as x=s+n,where s and n is the signal and noise component, respectively. Thesignal is pre-emphazised at 50 in order to emphasize signal componentswith a high frequency in order to be able to access the importantinformation present in these signal components that have a relativelylow power.

The basis for an improvement in the SNR (signal to noise ratio) of anobserved signal is the presence of one observed signal (from onemicrophone). The separation of the signal component and the noisecomponent must thus be based on some knowledge of the signal componentas well as the noise component. The overall idea of the invention is theutilization of the inertia conditioned partial stationarity of man-madesounds, as regards both articulation and intonation. The additive noisecomponent, n, is assumed to be “white”, pink or a combination thereof,and partly stationary in the second order statistics, but does notcontain stationary harmonic components.

The basic approach is a separation of the articulation and intonationcomponents via inverse LPC analysis/filtering 58. This ensures that theresidual signal becomes maximally “white” and just contains—in terms ofinformation—intonation components whose variation is assumed to bepartly stationary, as mentioned before.

The determination of the articulation components depends on the strengthof the noise, a distinction being made between three stages, viz. weak,intermediate and strong noise corresponding to an SNR of +6 dB, 0 dB and−6 dB, respectively.

For weak noise, the model parameters (LPC) 56 are determined on thebasis of the autocorrelation function derived directly from the Hammingweighted signal 52 by the autocorrelation means 54, and non-linearspectral gaining is performed (see the following) in the spectralgaining means 78 according to the PARTRAN concept, see EP publicationno. 0 796 489.

For the intermediate and strong noise situation, an indirect method isused for the determination of the autocorrelation function, which isstill the basis for the model based description of articulation.

The indirect determination of the autocorrelation function is based onthe relationship between power spectrum and autocorrelation (they arethe Fourier transforms of each other). The Hamming weighted signal isFourier-transformed with 512 points at 60 and diode-rectified at 62 witha given time constant. The minimum value of this signal is determinedand subtracted from the diode rectified amplitude spectrum, (where theappearance of the noise spectrum is known a priori, arbitrary noisespectra may be subtracted here. The knowledge may be obtained if it ispossible to identify phases in which the signal component is notpresent) thereby generating an amplitude spectral subtracted spectrum 64which, following squaring, is inverse-Fourier-transformed with a view todetermining the autocorrelation function 68. An effect means performsaid squaring. By using the autocorrelation the LPC coefficients can bedetermined 70. These coefficients are used in a pseudo decomposition 77in order to identify the f, b and g parameters. Then non-linear spectralgaining 78 is performed according to the PARTRAN concept followed byspectral sharpening 80 and pseudo composition 82 in order to obtain aspectrum from the model based description.

In all three cases of noise a model based (LPC) description of thearticulation is provided. This model spectrum forms the basis for thecalculation of the characteristic parameters of the energy maxima, viz.f, b and g parameters for each formant.

In connection with the weighting of these energy maxima a controlparameter STRUK is developed (see above), indicating the degree ofstructure in the observed signal. This parameter is used for spectralgaining 78 according to the PARTRAN concept (see EP publ. no. 0 796489).

The bandwidth threshold for reduction in the gain is controlled by theparameter STRUK as mentioned above.

The bandwidth threshold changes linearly in the region “intermediate”.Each energy maximum is now subjected to gain adjustment depending on thecurrent bandwidth and the current bandwidth threshold.

Artefacts in the form of the well-known “bath tub sounds” are eliminatedhereby. After spectral gaining 78, spectral sharpening 80 is performed,comprising adjusting the bandwidth of the energy maxima by the factorband fact.

The thus modified f, b and g parameters (f being unchanged here) areused for forming second order resonators with zero points positioned inZ=1 and Z=−1. The pulse response of these resonators coupled in paralleland with alternating signs are used as FIR filter coefficients in thesynthesis filter 74 (4-fold interpolation is performed).

Input signals to the synthesis filter 74 depend on the degree of thenoise, a distinction being made here again between weak, intermediateand strong noise.

For weak noise, the residual signal from the inverse filtering 58 isused.

For intermediate noise, the input signal to the inverse filter 58 isused (the pre-emphasized observed signal) This results in anatural/inherent spectral sharpening, beyond the one currently performedin the PARTRAN transposition.

In case of strong noise, the jitter on the pulse of the residual signalis of such a nature/size that none of the above signals can be used asinput to the synthesis filter 74. It is turned to account here that theintonation of man-made sounds is partly stationary, which is utilized ina modified pitch detection 72 based on a long observation window. Avoiced sound detection determines whether pitch is present, and if so, aresidual signal consisting of unit pulses of mean spacing is phased in.

As a result, the jitter is reduced significantly, and the synthesizedsignal is less corrupted by noise.

The basic ideas of the described method is to focus on quasi-stationarycomponents in the observed signal. The method identifies thesecomponents and “locks” to them as long as they have a suitable strengthand stationarity. This applies to both articulation and intonationcomponents. Generally, artefacts are avoided hereby in connection withthe filtering of the noise components. Many psycho-acoustic testsindicate that it is related methods which man uses inter alia in noisyenvironments.

As mentioned before, the method has been developed on the assumption ofone observed signal. In the situation where two or more microphones arepossible, this per se can give a noise reduction for the noisecomponents in the two signals which correlate with each other. Theremaining noise components may subsequently be eliminated via thedescribed method.

Although a preferred embodiment of the present invention has beendescribed and shown, the invention is not limited to it, but may also beembodied in other ways within the scope of the subject-matter defined inthe appended claims, for example increase in speechintelligibility/speech comfort by manipulation/weighting of the formantsin accordance with their strength/frequency or elimination of speakerdependent components in the speech signal, while maintaining speechintelligibility (speaker scrambling/encryption).

I claim:
 1. A method of noise reduction in a speech signal, wherein afirst spectrum is generated on the basis of the speech signal, a secondspectrum is generated as an estimate of the noise power spectrum, athird spectrum is generated by performing a spectral subtraction of saidfirst and second spectra, and a resulting speech signal is generated onthe basis of said third spectrum, and whenrin a model basedrepresentation describing the quasi-stationary part of the speech signalis generated on the basis of the third spectrum, said model basedrepresentation is manipulated, and said resulting speech signal isgenerated using said manipulated model based representation and a secondsignal derived from said speech signal.
 2. A method according to claim1, wherein said model based representation includes parametersdescribing one or more formants in said third spectrum.
 3. A methodaccording to claim 2, wherein said parameters reflect the resonancefrequency, the bandwidth, and the gain at the resonance frequency ofsaid formants in said third spectrum.
 4. A method according to claim 1,wherein said manipulation includes spectral gaining, which is based on astructure parameter S reflecting the structure in the spectrum.
 5. Amethod according to claim 4, wherein said structure parameter S is givenby S=B*G, where B is the bandwidth ratio of said formants in said thirdspectrum, and G is the gain ratio of said formants in said thirdspectrum.
 6. A method according to claim 1, wherein noise reduction isperformed in said second signal.
 7. A method according to claim 1,wherein said second signal corresponds to said speech signal.
 8. Amethod according to claim 1, wherein said second signal represents theresidual signal.
 9. A method according to claim 8, wherein varioussignal elements of said second signal, such as pitch pulse, stopconsonants and noise transients, are amplified or attenuated.
 10. Anapparatus for noise reduction in a speech signal, comprising spectrumgenerating means (1,12) adapted to generate a first spectrum on thebasis of the speech signal, noise spectrum generating means (2,10)adapted to generate a second spectrum as an estimate of the noise powerspectrum, special subtraction means (5,15) adapted to generate a thirdspectrum by performing spectral subtraction of said first and secondspectra, and signal generating means (9,19) adapted to generate aresulting speech signal on the basis of said third spectrum, saidapparatus further comprising: model generating means (17) adapted togenerate a model based representation describing the quasi-stationarypart of the speech signal on the basis of the third spectrum, modelmanipulating means (18) adapted to manipulate said model basedrepresentation, a second signal generating means (14) adapted to derivea second signal from said speech signal, and wherein said signalgenerating means (19) generates the resulting speech signal using saidmanipulated model based representation and second signal.
 11. Anapparatus according to claim 10, wherein said model generating means(17) generates a model which includes parameters describing one or moreformants in said third spectrum.
 12. An apparatus according to claim 11,wherein said parameters reflect the resonance frequency, the bandwidth,and the gain at the resonance frequency of said formants in said thirdspectrum.
 13. An apparatus according to claim 10, wherein said modelmanipulating means (18) forms a structure parameter S which reflects thestructure in the spectrum, and performs spectral gaining based on saidstructure parameter S.
 14. An apparatus according to claim 13, whereinsaid structure parameter S is given by S=B*G, where B is the bandwidthratio of said formants in said third spectrum, and G is the gain ratioof said formants in said third spectrum.
 15. An apparatus according toclaim
 10. wherein the apparatus further comprises noise reduction meanswhich performs noise reduction in said second signal.
 16. An apparatusaccording to claim 10, wherein said speech signal is used as said secondsignal.
 17. An apparatus according to claim 10, wherein the residualsignal is used as said second signal.
 18. An apparatus according toclaim 17, further comprising means (72) to amplify or attenuate varioussignal elements of said second signal, such as pitch pulses, stopconsonants and noise transients.